Design and Evaluation of a Fault-Tolerant Adaptive Router for Parallel Computers
نویسندگان
چکیده
In this paper, we propose a design methodology for faulttolerant adaptive routers for parallel and distributed computers. The key idea of our method is integrating minimal and non-minimal routing that is supported by independent virtual channels (VCs). Distinguishing the routing functions for each set of VCs simplifies the design of fault-tolerant algorithms. After describing the method, we show an application of a routing algorithm for two-dimensional mesh and torus networks. This algorithm, called Detour-NF, supports three routing modes: deterministic, minimal fully adaptive and non-minimal fault-tolerant operations. We also discuss the hardware cost and operational speed of minimal and non-minimal routers based on our design, which uses hardware description language (HDL). Communication performance and fault-tolerance are demonstrated by an HDL simulation. The experimental results show that supporting both minimal and non-minimal routing modes is advantageous for high-bandwidth and low-latency communication, as well as fault-tolerance.
منابع مشابه
Rule-Based Routing for Fault-Tolerant Parallel Computers
Routing has an important influence on the performance of interconnection networks in parallel computers. Besides simple oblivious schemes like xy-routing for 2D grids there exist a lot of sophisticated adaptive and fault-tolerant routing algorithms which could however not be implemented so far, because there are no fast hardware routers which are able to support them. In this paper such a flexi...
متن کاملCAFT: Cost-aware and Fault-tolerant routing algorithm in 2D mesh Network-on-Chip
By increasing, the complexity of chips and the need to integrating more components into a chip has made network –on- chip known as an important infrastructure for network communications on the system, and is a good alternative to traditional ways and using the bus. By increasing the density of chips, the possibility of failure in the chip network increases and providing correction and fault tol...
متن کاملDAMQ-Based Approach for Efficiently Using the Buffer Spaces of a NoC Router
In this paper we present high performance dynamically allocated multi-queue (DAMQ) buffer schemes for fault tolerance systems on chip applications that require an interconnection network. Two virtual channels shared the same buffer space. Fault tolerant mechanisms for interconnection networks are becoming a critical design issue for large massively parallel computers. It is also important to hi...
متن کاملOn Feasibility of Adaptive Level Hardware Evolution for Emergent Fault Tolerant Communication
A permanent physical fault in communication lines usually leads to a failure. The feasibility of evolution of a self organized communication is studied in this paper to defeat this problem. In this case a communication protocol may emerge between blocks and also can adapt itself to environmental changes like physical faults and defects. In spite of faults, blocks may continue to function since ...
متن کاملAn approach to fault detection and correction in design of systems using of Turbo codes
We present an approach to design of fault tolerant computing systems. In this paper, a technique is employed that enable the combination of several codes, in order to obtain flexibility in the design of error correcting codes. Code combining techniques are very effective, which one of these codes are turbo codes. The Algorithm-based fault tolerance techniques that to detect errors rely on the c...
متن کامل